Skip to content

CLI: add eval-runner result diagnostics#3349

Open
chubes4 wants to merge 1 commit intotrunkfrom
eval-runner-result-diagnostics
Open

CLI: add eval-runner result diagnostics#3349
chubes4 wants to merge 1 commit intotrunkfrom
eval-runner-result-diagnostics

Conversation

@chubes4
Copy link
Copy Markdown
Contributor

@chubes4 chubes4 commented May 5, 2026

Summary

  • Capture final SDK result metadata in eval-runner output: resultSubtype, resultIsError, resultStopReason, resultText, and resultErrors.
  • Treat empty successful completions as failed evals when the SDK reports success but the run produced neither assistant text nor final result text.
  • Add opt-in compact transcript diagnostics behind STUDIO_EVAL_INCLUDE_TRANSCRIPT=1, with text/tool-result truncation to avoid bloating default artifacts.

Why

This continues the eval-runner observability work from #3273 and #3330. Those PRs made phase/tool timings, first tool errors, loop exceptions, and timeouts visible. This adds the final SDK result shape and an opt-in turn transcript so eval consumers can distinguish model behavior, PI harness continuation behavior, runner classification, and downstream benchmark quality gates.

The need surfaced while testing the Static Site Importer draft path in #3309 with the Studio site-build benchmark. GPT-5.5 repeatedly produced tool-only runs for built-in prompt variants (restaurant, wordpress-is-dead): site_list / site_info returned successfully, then the SDK emitted subtype: \"success\", stopReason: \"end_turn\", and an empty result. No assistant text, Write, wp_cli, or import report was produced, but the eval runner classified the run as successful because it trusted message.subtype === 'success'.

With the local transcript diagnostics enabled, that failure shape was clear. The same diagnostics also helped compare Claude Sonnet 4.6 on the same SSI site-build flow: Claude generated source HTML and wrote files, then timed out while repairing generated helper-script errors before reaching import. Different failure mode, same need for better eval evidence.

This PR keeps the transcript opt-in so normal eval artifacts only gain a few scalar fields, while deeper debugging remains available when investigating model/runtime/harness regressions.

Validation

  • npm install to bootstrap the clean worktree.
  • npm run cli:build --silent — passed.
  • npx eslint apps/cli/ai/eval-runner.ts — passed.
  • npm -w wp-studio run typecheck — passed.
  • git diff --check — passed.

AI assistance

  • AI assistance: Yes
  • Tool(s): OpenCode (GPT-5.5)
  • Used for: Diagnosing the eval-runner artifact gap during SSI benchmark runs, drafting the result metadata / opt-in transcript patch, running local validation, and preparing this PR body. Chris reviewed the failure evidence and PR framing.

@wpmobilebot
Copy link
Copy Markdown
Collaborator

📊 Performance Test Results

Comparing 4fa17d4 vs trunk

app-size

Metric trunk 4fa17d4 Diff Change
App Size (Mac) 1454.03 MB 1454.03 MB +0.00 MB ⚪ 0.0%

site-editor

Metric trunk 4fa17d4 Diff Change
load 1522 ms 1516 ms 6 ms ⚪ 0.0%

site-startup

Metric trunk 4fa17d4 Diff Change
siteCreation 8078 ms 8078 ms 0 ms ⚪ 0.0%
siteStartup 4946 ms 4939 ms 7 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

@chubes4
Copy link
Copy Markdown
Contributor Author

chubes4 commented May 6, 2026

Will update this after #3360 lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants